About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Sun Jun 15 05:10:09 2025.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Load and Explore The Data

# Set CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))

#Install Necessary Packages
install.packages("timetk")
## package 'timetk' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("lubridate")
## package 'lubridate' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("GGally")
## package 'GGally' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("ggthemes")
## package 'ggthemes' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
#load Packages
library(timetk)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggthemes)
library(dplyr)

#Load Dataset
data("bike_sharing_daily")
bike_data <- bike_sharing_daily
View(bike_data)
#Exploratory Analysis
summary(bike_data)
##     instant          dteday               season            yr        
##  Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Median :2012-01-01   Median :3.000   Median :1.0000  
##  Mean   :366.0   Mean   :2012-01-01   Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714
glimpse(bike_data)
## Rows: 731
## Columns: 16
## $ instant    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday     <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05…
## $ season     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday    <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
## $ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ casual     <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
## $ registered <dbl> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
## $ cnt        <dbl> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
str(bike_data)
## spc_tbl_ [731 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ instant   : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date[1:731], format: "2011-01-01" "2011-01-02" ...
##  $ season    : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: num [1:731] 654 670 1229 1454 1518 ...
##  $ cnt       : num [1:731] 985 801 1349 1562 1600 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   instant = col_double(),
##   ..   dteday = col_date(format = ""),
##   ..   season = col_double(),
##   ..   yr = col_double(),
##   ..   mnth = col_double(),
##   ..   holiday = col_double(),
##   ..   weekday = col_double(),
##   ..   workingday = col_double(),
##   ..   weathersit = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   hum = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   cnt = col_double()
##   .. )
#Correlation between Temperature and Total Rental
cor(bike_data$temp, bike_data$cnt)
## [1] 0.627494
cor(bike_data$atemp, bike_data$cnt)
## [1] 0.6310657
#Correlation between Temperature and Casual/Registered User
cor(bike_data$temp, bike_data$registered)
## [1] 0.540012
cor(bike_data$temp, bike_data$casual)
## [1] 0.5432847
#Extract Date Components
bike_data <- bike_sharing_daily %>%
  mutate(
    date = as.Date(dteday),  # Create a new proper date column
    month = month(date, label = TRUE),
    year = year(date),
    weekday = wday(date, label = TRUE),
    weekend = ifelse(weekday %in% c("Sat", "Sun"), "Weekend", "Weekday")
  )

#Mean and Median Temperature by Season
bike_data %>%
  group_by(season) %>%
  summarise(mean_temp=mean(temp), median_temp=median(temp))
## # A tibble: 4 × 3
##   season mean_temp median_temp
##    <dbl>     <dbl>       <dbl>
## 1      1     0.298       0.286
## 2      2     0.544       0.562
## 3      3     0.706       0.715
## 4      4     0.423       0.409
#Monthly Summary of Temperature, Humidity, Windspeed, and Total Rentals 
bike_data %>%
  mutate(month=month(date, label=TRUE)) %>%
  group_by(month) %>%
  summarise(across(c(temp,atemp,hum,windspeed,cnt), mean, .names="mean_{col}"))
## # A tibble: 12 × 6
##    month mean_temp mean_atemp mean_hum mean_windspeed mean_cnt
##    <ord>     <dbl>      <dbl>    <dbl>          <dbl>    <dbl>
##  1 Jan       0.236      0.240    0.586          0.206    2176.
##  2 Feb       0.299      0.300    0.567          0.216    2655.
##  3 Mar       0.391      0.382    0.588          0.223    3692.
##  4 Apr       0.470      0.457    0.588          0.234    4485.
##  5 May       0.595      0.566    0.689          0.183    5350.
##  6 Jun       0.684      0.638    0.576          0.185    5772.
##  7 Jul       0.755      0.704    0.598          0.166    5564.
##  8 Aug       0.709      0.651    0.638          0.173    5664.
##  9 Sep       0.616      0.579    0.715          0.166    5767.
## 10 Oct       0.485      0.472    0.694          0.175    5199.
## 11 Nov       0.369      0.367    0.625          0.184    4247.
## 12 Dec       0.324      0.326    0.666          0.177    3404.
#Box Plot of Temperature by Season
boxplot(temp~season, data=bike_data,main="Temperature by Season", col="skyblue")

#Correlation Matrix of Continuous Variables
bike_data %>%
  select(temp,atemp,hum,windspeed,casual,registered,cnt) %>%
  ggpairs(title="Correlation Matrix of Key Variables")

Create Interactive Time Series Plots

#Time Series of Total Rentals
bike_data %>%
  plot_time_series(date,cnt, .interactive=TRUE, .title="Daily Bike Rental Over Time")
#Registered vs Casual Over Time
bike_data %>%
  select(date, registered,casual) %>%
  pivot_longer(col=-date, names_to="type", values_to="rentals") %>%
  plot_time_series(date, rentals, .color_var=type, .interactive=TRUE, .title="Registered vs Casual Users")
#Seasonal Diagonostics
bike_data %>%
  plot_seasonal_diagnostics(date, cnt)
#Anomaly Detection
bike_data %>%
  plot_anomaly_diagnostics(date,cnt)
## frequency = 7 observations per 1 week
## trend = 92 observations per 3 months

Smooth Time Series Data

#Load Packages
install.packages("forecast")
## package 'forecast' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("TTR")
## package 'TTR' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(TTR)

#Convert to Time Series Object
ts_data <- ts(bike_data$cnt, frequency=365)
ts_clean <- tsclean(ts_data)

#Plot Original vs Clean Data
plot(ts_data, col="red", main="Original vs Cleaned Time Series")
lines(ts_clean, col="blue")

#Simple Exponential Smoothing
ses_model <- HoltWinters(ts_clean, beta=FALSE, gamma=FALSE)
plot(ses_model, main="Simple Exponential Smoothing")

#Simple Moving Average with Order 10
sma_10 <- SMA(ts_clean, n=10)
plot.ts(sma_10, main="10-Day Moving Average", col="darkgreen")

Decompose and Assess The Stationarity of Time Series Data

#Load Test for Stationary
install.packages("tseries")
## package 'tseries' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'tseries'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\Mash\AppData\Local\Programs\R\R-4.5.0\library\00LOCK\tseries\libs\x64\tseries.dll
## to
## C:\Users\Mash\AppData\Local\Programs\R\R-4.5.0\library\tseries\libs\x64\tseries.dll:
## Permission denied
## Warning: restored 'tseries'
## 
## The downloaded binary packages are in
##  C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
library(tseries)

#Decompose Time Series
decompose <- decompose(ts_clean)
plot(decompose)

#Remove Seasonality
adjusted_ts <- ts_clean - decompose$seasonal
plot(adjusted_ts, main="Seasonally Adjusted Series")

adf.test(adjusted_ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  adjusted_ts
## Dickey-Fuller = -3.9571, Lag order = 9, p-value = 0.01114
## alternative hypothesis: stationary
#Plot ACF and PACF
acf(adjusted_ts)

pacf(adjusted_ts)

#Differencing if Non-Stationary
diff_ts <- diff(adjusted_ts)
plot(diff_ts, main="Differenced Series")

Fit and Forecast Time Series Data Using ARIMA Models

#Auto ARIMA M
auto_model <- auto.arima(ts_clean)
summary(auto_model)
## Series: ts_clean 
## ARIMA(1,0,3)(0,1,0)[365] with drift 
## 
## Coefficients:
##          ar1      ma1      ma2      ma3   drift
##       0.9683  -0.5912  -0.1279  -0.0937  5.7116
## s.e.  0.0224   0.0571   0.0617   0.0576  0.8318
## 
## sigma^2 = 986021:  log likelihood = -3042.81
## AIC=6097.63   AICc=6097.86   BIC=6121.05
## 
## Training set error measures:
##                   ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 5.85301 697.8113 385.8648 -2.699882 9.189324 0.1694626
##                      ACF1
## Training set -0.003587803
#Manual ARIMA Example (Can be Optimised)
manual_model <- arima(ts_clean, order=c(1,1,1))
summary(manual_model)
## 
## Call:
## arima(x = ts_clean, order = c(1, 1, 1))
## 
## Coefficients:
##          ar1      ma1
##       0.3692  -0.8751
## s.e.  0.0437   0.0213
## 
## sigma^2 estimated as 661725:  log likelihood = -5928.18,  aic = 11862.37
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE     MASE       ACF1
## Training set 11.13247 812.9082 588.1202 -6.429274 19.42943 0.894064 0.01109983
#Residual Diagnostic
shapiro.test(residuals(auto_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(auto_model)
## W = 0.83801, p-value < 2.2e-16
acf(residuals(auto_model))

pacf(residuals(auto_model))

#Forecast Next 25 Days
forecast_auto <- forecast(auto_model, h=25)
forecast_manual <- forecast(manual_model, h=25)

#Plot Forecast
plot(forecast_auto, main="Forecast with Auto ARIMA")

plot(forecast_manual, main="Forecast with Manual ARIMA")

Heatmap of Season vs Weekday Rentals

#Visualise Patterns Across Seasons and Weekdays
bike_data %>%
  group_by(season, weekday) %>%
  summarise(avg_count = mean(cnt)) %>%
  ggplot(aes(x = weekday, y = season, fill = avg_count)) +
  geom_tile() +
  scale_fill_viridis_c() +
  labs(title = "Heatmap: Average Rentals by Weekday and Season",
       x = "Weekday", y = "Season", fill = "Avg Rentals") +
  theme_minimal()
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.

Professional Time Series Plot

install.packages("ggthemes")
## Warning: package 'ggthemes' is in use and will not be installed
library(ggthemes)

bike_data %>%
  ggplot(aes(x = date, y = cnt)) +
  geom_line(color = "steelblue", size = 1) +
  labs(title = "Bike Rentals Over Time",
       subtitle = "Capital Bikeshare Program (2011–2012)",
       x = "Date", y = "Rental Count") +
  theme_economist()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Findings & Conclusions

This time series forecasting project successfully modeled the daily bike rental demand for the Capital Bikeshare system. The rigorous data cleaning, decomposition, and stationarity testing ensured a stable time series structure conducive to ARIMA modeling. The use of both manual and automated ARIMA models enabled comparative diagnostics, where Auto ARIMA was preferred due to its lower error metrics and better residual properties.

Overall, this analysis demonstrates the feasibility of data-driven demand forecasting in urban mobility. It highlights how weather and calendar features critically influence public transportation usage. These insights are highly valuable for urban planners and operational teams aiming to optimize bike distribution, plan fleet expansion, or build adaptive pricing strategies.

The project not only meets the goals of forecasting but also illustrates the importance of statistical rigor in transforming raw data into actionable intelligence.